data from https://data.rijksmuseum.nl/object-metadata/download/
This Comma Separated Values file (202020-rma-csv-collection.zip) provides a simple inventory of objects in the Rijksmuseum collection. It includes the object number and persistent identifier, as well as a single title, type, creator, date and image URL for each object.
## [1] 5
original data has column names:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.6 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
data <- read_csv("rma-csv-collection.csv")
## Rows: 667894 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): objectInventoryNumber, objectPersistentIdentifier, objectTitle[1], ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
# A tibble:667,894 × 7
column 3, 4, 5, 6 should be kept meanwhile on inspecting objectCreationDate[1] there are many NA values and strings. Due to the original file being too large (over 100MB),
data <- data %>%
transmute(
title = data[[3]],
type = data[[4]],
creator = data[[5]],
year = data[[6]],
index = row_number()
) %>%
# filter out irregular years such as 0000-00
filter(nchar(year) < 6, index %% 50 == 0)
write_csv(data, "rma-downsize.csv")
data
current size
## total 139M
## -rwxrwxrwx 1 alexm alexm 1.3M Nov 17 17:35 project-description.nb.html
## -rwxrwxrwx 1 alexm alexm 2.5K Nov 18 23:36 project-description.Rmd
## -rwxrwxrwx 1 alexm alexm 137M Nov 18 22:08 rma-csv-collection.csv
## -rwxrwxrwx 1 alexm alexm 955K Nov 18 23:36 rma-downsize.csv
## -rwxrwxrwx 1 alexm alexm 13K Nov 13 17:54 sample.png
data <- read_csv("rma-downsize.csv")
## Rows: 11997 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): title, type, creator
## dbl (2): year, index
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
to explore the change of the top 7 type according to year.
## [1] 1 5830
reorder the types according to count, choose the top 7 (or if the top 7 are too close, choose the 1st, 21st, 41st, .. in the list)
str(data$year) # chr [1:667894]
## num [1:11997] 1880 1650 -1600 1600 1368 ...
year contains strings and NA, to clean the data - filter out
rows with year that is not a number - try keep the x axis
continuous, if it doesn’t work, then fct_lump() it into
ranges
the visualization will be something like and hopefully fancier and with annotations staying
in the right places